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ABSTRACT 


Near-duplicate web documents are abundant. Two such 
documents differ from each other in a very small portion 
that displays advertisements, for example. Such differences 
are irrelevant for web search. So the quality of a web crawler 
increases if it can assess whether a newly crawled web page 
is a near-duplicate of a previously crawled web page or not. 

In the course of developing a near-duplicate detection sys- 
tem for a multi-billion page repository, we make two research 
contributions. First, we demonstrate that Charikar’s finger- 
printing technique is appropriate for this goal. Second, we 
present an algorithmic technique for identifying existing f- 
bit fingerprints that differ from a given fingerprint in at most 
k bit-positions, for small k. Our technique is useful for both 
online queries (single fingerprints) and batch queries (mul- 
tiple fingerprints). Experimental evaluation over real data 
confirms the practicality of our design. 


Categories and Subject Descriptors 


E.1 [Data Structures]: Distributed data structures; G.2.0 
[Discrete Mathematics]: General; H.3.3 [Information 
Search and Retrieval]: Search process 


General Terms 
Algorithms 


Keywords 


Hamming distance, near-duplicate, similarity, search, sketch, 
fingerprint, web crawl, web document 


1. INTRODUCTION 


Web crawling is an integral piece of infrastructure for 
search engines. Generic crawlers [1,9] crawl documents 
and links belonging to a variety of topics, whereas focused 
crawlers [27,43,46] use some specialized knowledge to limit 
the crawl to pages pertaining to specific topics. For web 
crawling, issues like freshness and efficient resource usage 
have previously been addressed [15, 16,19]. However, the 
problem of elimination of near-duplicate web documents in 
a generic crawl has not received attention. 


*Anish worked on this problem at Google in Dec 2005. 


Copyright is held by the International World Wide Web Conference Com- 
mittee (IW3C2). Distribution of these papers is limited to classroom use, 
and personal use by others. 

WWW 2007, May 8-12, 2007, Banff, Alberta, Canada. 

ACM 978-1-59593-654-7/07/0005. 


Arvind Jain 
Google Inc. 


arvind@google.com 


141 


* 
Anish Das Sarma 
Stanford University 


anishds@stanford.edu 


Documents that are exact duplicates of each other (due to 
mirroring and plagiarism) are easy to identify by standard 
checksumming techniques. A more difficult problem is the 
identification of near-duplicate documents. Two such docu- 
ments are identical in terms of content but differ in a small 
portion of the document such as advertisements, counters 
and timestamps. These differences are irrelevant for web 
search. So if a newly-crawled page Paupticate is deemed a 
near-duplicate of an already-crawled page P, the crawl en- 
gine should ignore Paupticate and all its out-going links (in- 
tuition suggests that these are probably near-duplicates of 
pages reachable from P). Elimination of near-duplicates? 
saves network bandwidth, reduces storage costs and im- 
proves the quality of search indexes. It also reduces the 
load on the remote host that is serving such web pages. 

A system for detection of near-duplicate pages faces a 
number of challenges. First and foremost is the issue of scale: 
search engines index billions of web-pages; this amounts to 
a multi-terabyte database. Second, the crawl engine should 
be able to crawl billions of web-pages per day. So the deci- 
sion to mark a newly-crawled page as a near-duplicate of an 
existing page should be made quickly. Finally, the system 
should use as few machines as possible. 


Our contributions: 

A. We show that Charikar’s simhash [17] is practically use- 
ful for identifying near-duplicates in web documents belong- 
ing to a multi-billion page repository. simhash is a finger- 
printing technique that enjoys the property that fingerprints 
of near-duplicates differ in a small number of bit positions. 
We experimentally validate that for a repository of 8B web- 
pages, 64-bit simhash fingerprints and k = 3 are reasonable. 

B. We develop a technique for solving the Hamming Dis- 
tance Problem: In a collection of f-bit fingerprints, quickly 
find all fingerprints that differ from a given fingerprint in at 
most k bit positions, where k is a small integer. Our tech- 
nique is useful for both online queries (single fingerprints) 
and batch queries (multiple fingerprints). 

C. We present a survey of algorithms and techniques for 
duplicate detection. 


Road-map: In §2, we discuss simhash. In §3, we present 
a technique for tackling the Hamming Distance Problem. 
In §4, we present experimental results. In §5, we present a 
survey of duplicate-detection techniques. 


lIn practice, presence/absence of near-duplicates may not 
translate into a binary yes/no decision for eliminating pages 
from the crawl; instead, it may be used as one of a small 
number of scoring components that set the priority of a URL 
for crawling purposes. 
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2. FINGERPRINTING WITH SIMHASH 


Charikar’s simhash [17] is a dimensionality reduction tech- 
nique. It maps high-dimensional vectors to small-sized fin- 
gerprints. It is applied to web-pages as follows: we first con- 
vert a web-page into a set of features, each feature tagged 
with its weight. Features are computed using standard IR 
techniques like tokenization, case folding, stop-word removal, 
stemming and phrase detection. A set of weighted features 
constitutes a high-dimensional vector, with one dimension 
per unique feature in all documents taken together. With 
simhash, we can transform such a high-dimensional vector 
into an f-bit fingerprint where f is small, say 64. 


Computation: Given a set of features extracted from a 
document and their corresponding weights, we use simhash 
to generate an f-bit fingerprint as follows. We maintain an 
f-dimensional vector V, each of whose dimensions is initial- 
ized to zero. A feature is hashed into an f-bit hash value. 
These f bits (unique to the feature) increment /decrement 
the f components of the vector by the weight of that feature 
as follows: if the i-th bit of the hash value is 1, the i-th com- 
ponent of V is incremented by the weight of that feature; 
if the i-th bit of the hash value is 0, the i-th component 
of V is decremented by the weight of that feature. When 
all features have been processed, some components of V are 
positive while others are negative. The signs of components 
determine the corresponding bits of the final fingerprint. 


Empirical results: For our system, we used the original 
C++ implementation of simhash, done by Moses Charikar 
himself. Concomitant with the development of our system in 
2004—2005, Monika Henzinger conducted a study that com- 
pared simhash with Broder’s shingle-based fingerprints [14]. 
An excellent comparison of these two approaches appears 
in Henzinger [35]. A great advantage of using simhash over 
shingles is that it requires relatively small-sized fingerprints. 
For example, Broder’s shingle-based fingerprints [14] require 
24 bytes per fingerprint (it boils down to checking whether 
two or more Rabin fingerprints out of six are identical). 
With simhash, for 8B web pages, 64-bit fingerprints suffice; 
we experimentally demonstrate this in §4. 


Properties of simhash: Note that simhash possesses two 
conflicting properties: (A) The fingerprint of a document is 
a “hash” of its features, and (B) Similar documents have 
similar hash values. The latter property is atypical of hash- 
functions. For illustration, consider two documents that dif- 
fer in a single byte. Then cryptographic hash functions like 
SHA-1 or MD5 will hash these two documents (treated as 
strings) into two completely different hash-values (the Ham- 
ming distance between the hash values would be large). 
However, simhash will hash them into similar hash-values 
(the Hamming distance would be small). 

In designing a near-duplicate detection system based on 
simhash, one has to deal with the quaintness of simhash de- 
scribed above. The strategy we employed is as follows: we 
design our algorithms assuming that Property A holds, i.e., 
the fingerprints are distributed uniformly at random, and we 
experimentally measure the impact of non-uniformity intro- 
duced by Property B on real datasets. 


After converting documents into simhash fingerprints, we 
face the following design problem: Given a 64-bit fingerprint 
of a recently-crawled web page, how do we quickly discover 
other fingerprints that differ in at most 3 bit-positions? We 
address this problem in the next Section. 


142 


Session: Similarity Search 


3. THE HAMMING DISTANCE PROBLEM 


Definition: Given a collection of f-bit fingerprints and a 
query fingerprint F, identify whether an existing fingerprint 
differs from F in at most k bits. (In the batch-mode version 
of the above problem, we have a set of query fingerprints 
instead of a single query fingerprint). 

As a concrete instance of the above problem’, consider a 
collection of 8B 64-bit fingerprints, occupying 64GB. In the 
online version of the problem, for a query fingerprint F, we 
have to ascertain within a few milliseconds whether any of 
the existing 8B 64-bit fingerprints differs from F in at most 
k = 3 bit-positions. In the batch version of the problem, we 
have a set of, say, 1M query fingerprints (instead of a solitary 
query fingerprint F) and we have to solve the same problem 
for all 1M query fingerprints in roughly 100 seconds. This 
would amount to a throughput of 1B queries per day. 

Let us explore the design space by considering two simple- 
minded but impractical approaches. One approach is to 
build a sorted table of all existing fingerprints. Given F, we 
probe such a table with each F’ whose Hamming distance 
from F is at most k. The total number of probes is pro- 
hibitively large: for 64-bit fingerprints and k = 3, we need 
(È) = 41664 probes. An alternative is to pre-compute all 
F’ such that some existing fingerprint is at most Hamming 
distance k away from F’. In this approach, the total number 
of pre-computed fingerprints is prohibitively large: it could 
be as many as 41664 times the number of fingerprints. 

We now develop a practical algorithm that lies in between 
the two approaches outlined above: it is possible to solve the 
problem with a small number of probes and by duplicating 
the table of fingerprints by a small factor. 


Intuition: Consider a sorted table of 2f f-bit truly ran- 
dom fingerprints. Focus on just the most significant d bits 
in the table. A listing of these d-bit numbers amounts to 
“almost a counter” in the sense that (a) quite a few 2? bit- 
combinations exist, and (b) very few d-bit combinations are 
duplicated. On the other hand, the least significant f — d 
bits are “almost random”. 

Now choose d’ such that |d’ — d| is a small integer. Since 
the table is sorted, a single probe suffices to identify all fin- 
gerprints which match F in d’ most significant bit-positions. 
Since |d’ — d| is small, the number of such matches is also 
expected to be small. For each matching fingerprint, we can 
easily figure out if it differs from F in at most k bit-positions 
or not (these differences would naturally be restricted to the 
f —d least-significant bit-positions). 

The procedure described above helps us locate an existing 
fingerprint that differs from F in k bit-positions, all of which 
are restricted to be among the least significant f — d’ bits of 
F. This takes care of a fair number of cases. To cover all 
the cases, it suffices to build a small number of additional 
sorted tables, as formally outlined in the next Section. 


3.1 Algorithm for Online Queries 


We build t tables: Tı, T2,..., Tı. Associated with table T; 
are two quantities: an integer p; and a permutation 7; over 
the f bit-positions. Table T; is constructed by applying 
permutation 7; to each existing fingerprint; the resulting 
set of permuted f-bit fingerprints are sorted. Further, each 
table is compressed (see §3.2) and stored in main-memory 


? Please note that the numerical values chosen for the online 
and the batch versions are for illustrative purposes only. 
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of a set of machines. Given fingerprint F and an integer k, 
we probe these tables in parallel: 


Step 1: Identify all permuted fingerprints in T; whose top 
pi bit-positions match the top p; bit-positions of m; (F). 


Step 2: For each of the permuted fingerprints identified in 
Step 1, check if it differs from 7;(F) in at most k bit- 
positions. 


In Step 1, identification of the first fingerprint in table T; 
whose top p; bit-positions match the top p; bit-positions of 
mi(F) can be done in O(p;) steps by binary search. If we as- 
sume that each fingerprint were truly a random bit sequence, 
interpolation search shrinks the run time to O(log pi) steps 
in expectation [52]. 


3.1.1 Exploration of Design Parameters 


Let us see how a reasonable combination of t and p; can be 
fixed. We have two design goals: (1) a small set of permuta- 
tions to avoid blowup in space requirements; and (2) large 
values for various p; to avoid checking too many fingerprints 
in Step 2. Recall that if we seek all (permuted) fingerprints 
which match the top p; bits of a given (permuted) finger- 
print, we expect 2¢~” fingerprints as matches. Armed with 
this insight, we present some examples for f = 64 and k = 3. 
We present an analytic solution in §3.1.2. 


EXAMPLE 3.1. Consider f = 64 (64-bit fingerprints), and 
k = 3 so near-duplicates’ fingerprints differ in at most 3 bit- 
positions. Assume we have 8B = 2%" existing fingerprints, 
i.e. d= 34. Here are four different designs, each design has 
a different set of permutations and p; values. 


20 tables: Split 64 bits into 6 blocks having 11, 11, 11, 
11, 10 and 10 bits respectively. There are £) = 20 
ways of choosing 3 out of these 6 blocks. For each 
such choice, permutation n corresponds to making the 
bits lying in the chosen blocks the leading bits (there 
are several such permutations; we choose one of them 
uniformly at random). The value of pi is the total 
number of bits in the chosen blocks. Thus pi = 31, 32 
or 33. On average, a probe retrieves at most 234731 = 
8 (permuted) fingerprints. 


16 tables: Split 64 bits into 4 blocks, each having 16 bits. 
There are (3) = 4 ways of choosing 1 out of these 4 
blocks. For each such choice, we divide the remaining 
48 bits into four blocks having 12 bits each. There are 
($) = 4 ways of choosing 1 out of these 4 blocks. The 
permutation for a table corresponds to placing the bits 
in the chosen blocks in the leading positions. The value 
of pi is 28 for all blocks. On average, a probe retrieves 
234-28 — 64 (permuted) fingerprints. 


10 tables: Split 64 bits into 5 blocks having 13, 18, 13, 18 
and 12 bits respectively. There are (3) = 10 ways of 
choosing 2 out of these 5 blocks. For each such choice, 
permutation m corresponds to making the bits lying in 
the chosen blocks the leading bits. The value of pi is 
the total number of bits in the chosen blocks. Thus 
pi = 25 or 26. On average, a probe retrieves at most 
234-25 — 512 (permuted) fingerprints. 


4 tables: Split 64 bits into 4 blocks, each having 16 bits. 
There are (1) = 4 ways of choosing 1 out of these 
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4 blocks. For each such choice, permutation m corre- 
sponds to making the bits lying in the chosen blocks 
the leading bits. The value of pi is the total number of 
bits in the chosen blocks. Thus pi = 16. On average, 
a probe retrieves at most 234~1° = 256K (permuted) 
fingerprints. 


3.1.2 Optimal Number of Tables 


Example 3.1 shows that many different design choices are 
possible for a fixed choice of f and k. Increasing the number 
of tables increases p; and hence reduces the query time. De- 
creasing the number of tables reduces storage requirements, 
but reduces p; and hence increases the query time. 

A reasonable approach to fix the trade-off between space 
and time is to ask the following question: How many ta- 
bles do we need if we restrict the minimum value of p; to 


some constant? For a fixed number of documents 2%, size of 
fingerprint f, and maximum allowed hamming distance k, 
the general solution to the problem is given by the following 
expression: 


1 ifd< T 
min,>k (7) ase. k,d— Lht) otherwise 


X(f,k,d) = { 


where X(f,k,d) represents the number of tables required, 
and the threshold 7 is determined by the minimum value 
allowed value of p;: If the minimum value is pmin, T 
d — Pmin- 

Alternately, one could ask what the maximum value of p; 
is if we restrict the total number of tables to some number. 
This problem can be solved similarly. 


3.2 Compression of Fingerprints 


Compression can shrink the sizes of individual tables. For 
example, table sizes for 8B documents and 64-bit finger- 
prints can be shrunk to approximately half their sizes. 

The main insight is that successive fingerprints share the 
top d bits in expectation. We exploit this fact as follows. 
Let h denote the position of the most-significant 1-bit in 
the XOR of two successive fingerprints. Thus h takes values 
between 0 and f — 1. For a given table, we first compute 
the distribution of h values and then compute a Huffman 
code [37] over [0, f— 1] for this distribution. Next, we choose 
a parameter B denoting the block size. A typical value for 
B would be 1024 bytes. A block with B bytes has 8B bits. 
We scan the sorted sequence of (permuted) fingerprints in a 
table and populate successive blocks as follows: 


Step 1: The first fingerprint in the block is remembered in 
its entirety. This consumes 8f bits. Thereafter, Step 
2 is repeated for successive fingerprints until a block is 
full, i.e., we cannot carry out Step 2 without needing 
8B + 1 bits or more. 


Step 2: Compute the XOR of the current fingerprint with 
the previous fingerprint. Find the position of the most- 
significant 1-bit. Append the Huffman code for this 
bit-position to the block. Then append the bits to the 
right of the most-significant 1-bit to the block. 


The key associated with a block is the last fingerprint that 
was remembered in that block. When a (permuted) finger- 
print arrives, an interpolation search [52] on the keys helps 
us figure out which block to decompress. Depending upon 
the value of p; and d, and on the distribution of fingerprints 
(simhash tends to cluster similar documents together), we 
occasionally have to decompress multiple blocks. 
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3.3 Algorithm for Batch Queries 


As mentioned at the beginning of §3, in the batch version 
of the Hamming Distance Problem, we have a batch of query 
fingerprints instead of a solitary query fingerprint. 

Assume that existing fingerprints are stored in file F and 
that the batch of query fingerprints are stored in file Q. With 
8B 64-bit fingerprints, file F will occupy 64GB. Compression 
(see §3.2) shrinks the file size to less than 32GB. A batch 
has of the order of 1M fingerprints, so let us assume that file 
Q occupies 8MB. 

At Google, for example, files F and Q would be stored in a 
shared-nothing distributed file system called GFS [29]. GFS 
files are broken into 64MB chunks. Each chunk is replicated 
at three (almost) randomly chosen machines in a cluster; 
each chunk is stored as a file in the local file system. 

Using the MapReduce framework [24], the overall compu- 
tation can be split conveniently into two phases. In the first 
phase, there are as many computational tasks as the number 
of chunks of F (in MapReduce terminology, such tasks are 
called mappers). Each task solves the Hamming Distance 
Problem over some 64-MB chunk of F and the entire file Q 
as inputs. A list of near-duplicate fingerprints discovered 
by a task is produced as its output. In the second phase, 
MapReduce collects all the outputs, removes duplicates and 
produces a single sorted file. 

We would like to mention a couple of points about effi- 
ciency. First, MapReduce strives to maximize locality, i.e., 
most mappers are co-located with machines that hold the 
chunks assigned to them; this avoids shipping chunks over 
the network. Second, file Q is placed in a GFS directory 
with replication factor far greater than three. Thus copy- 
ing file Q to various mappers does not become a bottleneck 
(please see the GFS paper for a discussion of this issue). 

How do we solve the Hamming Distance Problem with 
file Q and a 64-MB chunk of file F? We build tables, as 
outlined in §3.1, corresponding to file Q (note that for the 
online mode, the tables were built for file F). Since each 
individual uncompressed table occupies 8MB, we can eas- 
ily build 10 such tables in main memory, without worrying 
about compression. After building the tables, we scan the 
chunk sequentially, probing the tables for each fingerprint 
encountered in the scan. 


3.4 Previous Work 


A generalized version of the Hamming Distance Problem 
was first proposed by Minsky and Papert [44]: Given a set 
of n f-bit strings (chosen by an adversary), and a string F, 
the goal is to identify strings in the set which differ from F 
in at most d bit-positions. No efficient solutions are known 
for general n, f and d. A theoretical study was initiated 
by Yao and Yao [53], who developed an efficient algorithm 
for d = 1. Their algorithm was improved by Brodal and 
Gasienec [10] and Brodal and Venkatesh [11]. For large d, 
some progress is reported by Greene, Parnas and Yao [31], 
Dolev et al [28] and Arslan and Egecioglu [3]. 

Our problem differs from the one addressed by the theory 
community in two aspects. First, we assume that the in- 
put consists of bit-strings chosen uniformly at random (with 
some non-uniformity introduced by simhash which hashes 
similar documents to similar values). Second, we deal with 
a very large number of bit-strings that do not fit in the main 
memory of one machine; this limits us to simple external 
memory algorithms that work well in a distributed setting. 
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Precision-Recall Graph Varying ’k’ for 64-bit SimHash 


Precision and Recall 


i Recall Sn 


1 2 3 4 5 6 7 8 9 
Hamming Distance (hd) ’k’ 


Figure 1: Precision vs recall for various k. 


4. EXPERIMENTAL RESULTS 


No previous work has studied the trade-off between f and 
k for the purpose of detection of near-duplicate web-pages 
using simhash. So our first goal was to ascertain whether 
simhash is a reasonable fingerprinting technique for near- 
duplicate detection in the first place. We study simhash in 
§4.1. Next, we wanted to make sure that the clusters pro- 
duced by simhash do not impact our algorithms significantly. 
We analyze distributions of fingerprints in §4.2. Finally, we 
touch upon running times and scalability issues in §4.3. 


4.1 Choice of Parameters 


We experimented with 234 = 8B simhash fingerprints. We 
varied k from 1 to 10. For each k, we randomly sampled an 
equal number of pairs of fingerprints that are at Hamming 
distance exactly k. We manually tagged each pair as: (1) 
true positive; (2) false positive; or (3) unknown. We used 
guidelines from [35] for deciding which of the three categories 
to put a pair in — radically different pairs are false positive; 
pages that differ slightly, such as only in counters, ads, or 
timestamps are true positive; and, pages that cannot be 
evaluated, e.g., because of content in non-English language, 
or because a login is needed to access the page, are tagged 
as unknown. 

Figure 1 plots the precision-recall graph for our experi- 
ments. Precision is defined as the fraction of reported near- 
duplicates (i.e., having hamming distance at most k) that 
are true positives. Recall denotes the fraction of the total 
number of near-duplicate pairs (in our sample) that get de- 
tected with Hamming distance at most k. 

Figure 1 clearly shows the trade-offs for various values of 
k: A very low value misses near-duplicates (false negatives), 
and a very high value tags incorrect pairs as near-duplicates 
(false positives). Choosing k = 3 is reasonable because both 
precision and recall are near 0.75. So, for 64-bit fingerprints, 
declaring two documents as near-duplicates when their fin- 
gerprints differ in at most 3 bits gives fairly high accuracy. 


4.2 Distribution of Fingerprints 

We designed our algorithm assuming that simhash finger- 
prints of documents over the web are uniformly random. 
However, simhash tends to cluster similar documents to- 
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Distribution of leading 1-bit positions 
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(a) Distribution of leading 1-bit positions 
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20 


Session: Similarity Search 


Distribution of 64-bit fingerprints 


Bucket Frequency 4 
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(b) Bucketization of fingerprints. 


Figure 2: Analysis of fingerprint distributions. 


gether. Figure 2(a) illustrates this phenomenon quantita- 
tively. In Figure 2(a), we plot the distribution of bit-positions 
of the leading 1-bit in XOR’s of successive fingerprints. Ifthe 
fingerprints were truly random, we would have seen a sym- 
metric distribution which would decay exponentially (the y- 
value would diminish by half for every increment /decrement 
of the x-value). Note that the right-half of the distribution 
indeed exhibits this behavior. However, the left-half of the 
distribution does not drop off rapidly; there is significant 
density. This is clearly due to clustering of documents; there 
are pairs of documents whose simhash values differ by a mod- 
erate number of bits because they contain similar content. 

In Figure 2(b), we plot the distribution of fingerprints 
in 128 buckets; bucket boundaries are defined by dividing 
the space of 2/ fingerprints into 128 equi-sized contiguous 
intervals. Fingerprints are more or less equally spaced out. 
Curiously, some spikes exist. These occur due to a variety 
of reasons. Some examples: (i) several pages are empty; 
all of these have simhash value of 0, (ii) there are several 
instances of “File not Found” pages, and (iii) many websites 
use the same bulletin board software; the login pages for 
these websites are similar. 


4.3 Scalability 


For the batch mode algorithm, a compressed version of 
File Q occupies almost 32GB (as compared with 64GB un- 
compressed). With 200 mappers, we can scan chunks at a 
combined rate of over 1GBps. So the overall computation 
finishes in fewer than 100 seconds. Compression plays an 
important role in speedup because for a fixed number of 
mappers, the time taken is roughly proportional to the size 
of file Q. 


5. DUPLICATE DETECTION: A SURVEY 


A variety of techniques have been developed to identify 
pairs of documents that are “similar” to each other. These 
differ in terms of the end goal, the corpus under consider- 
ation, the feature-set identified per document and the sig- 
nature scheme for compressing the feature-set. In this Sec- 
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tion, we present a comprehensive review of near-duplicate 
detection systems. In the process of summarizing the over- 
all design-space, we highlight how our problem differs from 
earlier work and why it merits a simhash-based approach. 


5.1 The nature of the corpus 


Broadly speaking, duplicate-detection systems have been 
developed for four types of document collections: 


a) Web Documents: Near-duplicate systems have been 
developed for finding related-pages [25], for extracting 
structured data [2], and for identifying web mirrors [6,7]. 

b) Files in a file system: Manber [42] develops algorithms 

for near-duplicate detection to reduce storage for files. 

The Venti file system [48] and the Low-bandwidth file 


system [45] have similar motivations. 


E-mails: Kolcz et al [40] identify near-duplicates for 
spam detection. 


Domain-specific corpora: Various groups have de- 
veloped near-duplicate detection systems for legal doc- 
uments (see Conrad and Schriber [22]), TREC bench- 
marks, Reuters news articles, and Citeseer data. 


Our work falls into the first category (Web Documents). 
We experimented with 8B pages — this is way larger than 
collection sizes tackled by previous studies: web-clustering 
by Broder et al [14] (30M URLs in 1996), “related pages” 
by Dean and Henzinger [25] (180M URLs in 1998), web- 
clustering by Haveliwala et al [33] (35M URLs in 2000). 


5.2 The end goal: why detect duplicates? 


a) Web mirrors: For web search, successful identification 
of web mirrors results in smaller crawling/storage/indexing 
costs in the absence of near-duplicates, better top-k re- 
sults for search queries, improvement in page-rank by re- 
ducing the in-degree of sites resulting from near-duplicates, 
cost-saving by not asking human evaluators to rank near- 
duplicates. See Bharat et al [6,7] for a comparison of 
techniques for identifying web-mirrors. 
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b) 


Clustering for “related documents” query: For ex- 
ample, given a news article, a web-surfer might be inter- 
ested in locating news articles from other sources that 
report the same event. The notion of “similarity” is at 
a high level — one could say that the notion of similar- 
ity is “semantic” rather than “syntactic”, quite differ- 
ent from the notion of duplicates or near-duplicates dis- 
cussed above. One approach is to use Latent Semantic 
Indexing [26]. Another approach is to exploit the link- 
age structure of the web (see Dean and Henzinger [25] 
who build upon Kleinberg’s idea of hubs and authori- 
ties [39]). Going further along these lines, Kumar et al 
[41] have proposed discovering “online communities” by 
identifying dense bipartite sub-graphs of the web-graph. 


Data extraction: Given a moderate-sized collection of 
similar pages, say reviews at www.imdb.com, the goal is 
to identify the schema/DTD underlying the collection 
so that we can extract and categorize useful informa- 
tion from these pages. See Joshi et al [38] (and refer- 
ences therein) for a technique that clusters web-pages on 
the basis of structural similarity. See Arasu and Garcia- 
Molina [2] for another technique that identifies templates 
underlying pages with similar structure. Also note that 
metadata (HTML tags) was ignored in a) and b) above. 


Plagiarism: Given a set of reports, articles or assignment- 


submissions (both source-code and textual reports), the 
goal is to identify pairs of documents that seem to have 
borrowed from each other significantly. For some early 
work in this area, see articles by Baker [4,5], the COPS 
system by Brin et al [8] and SCAM by Shivakumar and 
Garcia-Molina [51]. 

Spam detection: Given a large number of recently- 
received e-mails, the goal is to identify SPAM before 
depositing the e-mail into recipients’ mailboxes. The 
premise is that spammers send similar e-mails en masse, 
with small variation in the body of these e-mails. See 
Kolcz et al [40], who build upon previous work by Chowd- 
hury et al [20]. 


Duplicates in domain-specific corpora: The goal is 
to identify near-duplicates arising out of revisions, modi- 
fications, copying or merger of documents, etc. See Con- 
rad and Schriber [22] for a case-study involving legal doc- 
uments at a firm. Manber [42] initiated an investigation 
into identification of similar files in a file system. 


Our near-duplicate detection system improves web crawling, 
a goal not shared with any of the systems described above. 


5.3 Feature-set per document 


a) 


Shingles from page content: Consider the sequence 
of words in a document. A shingle is the hash-value of 
a k-gram which is a sub-sequence of k successive words. 
The set of shingles constitutes the set of features of a 
document. The choice of k is crucial?. Hashes of succes- 
sive k-grams can be efficiently computed by using Ra- 
bin’s fingerprinting technique [49]. Manber [42] created 
shingles over characters. The COPS system by Brin et 
al [8] used sentences for creating shingles. Broder et al 
[12, 14] created shingles over words. The total number 
of shingles per document is clearly large. Therefore, a 


3Small k makes dissimilar documents appear similar. Large 
k makes similar documents appear dissimilar. 
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small-sized signature is computed over the set of shin- 
gles, as described in the next sub-section. 


b) Document vector from page content: In contrast 
to shingles, a document can be characterized by deploy- 
ing traditional IR techniques. The idea is to compute its 
“document-vector” by case-folding, stop-word removal, 
stemming, computing term-frequencies and finally, weigh- 
ing each term by its inverse document frequency (IDF). 
Next, given two documents, a “measure” of similarity 
is defined. Hoad and Zobel [36] argue that the tradi- 
tional cosine-similarity measure is inadequate for near- 
duplicate detection. They define and evaluate a vari- 
ety of similarity measures (but they do not develop any 
signature-scheme to compress the document-vectors). 


A different approach is taken by Chowdhury et al [20] 
who compute a lexicon (the union of all terms existing 
in the collection of documents). The lexicon is then 
pruned (a variety of schemes are studied by the au- 
thors). Each document-vector is then modified by re- 
moving terms that have been pruned from the lexicon. 
The resulting document-vectors are fingerprinted. Two 
documents are said to be near-duplicates iff their fin- 
gerprints match. This scheme is rather brittle for near- 
duplicate detection — a follow-up paper [40] ameliorates 
the problem by constructing multiple lexicons (these are 
random subsets of the original lexicon). Now multiple 
fingerprints per document are computed and two docu- 
ments are said to be duplicates iff most of their finger- 
prints match. 


An issue to keep in mind when dealing with document- 
vectors is that the IDF of any term is global information 
which changes as the collection changes. 


c) Connectivity information: For the purpose of finding 
“related pages”, Dean and Henzinger [25] exploited the 
linkage structure of the web. The premise is that simi- 
lar pages would have several incoming links in common. 
Haveliwala et al [34] point out that the quality of dupli- 
cate detection is poor for pages with very few incoming 
links. This can be ameliorated by taking anchor text and 
anchor windows into account. 


d) Anchor text, anchor window: Similar documents 
should have similar anchor text. Haveliwala et al [34] 
study the impact of anchor-text and anchor-windows, 
where an anchor-window is the text surrounding the anchor- 
text, for example, the paragraph it belongs to. The 
words in the anchor text/window are folded into the 
document-vector itself. A weighing function that dimin- 
ishes the weight of words that are farther away from the 
anchor text is shown to work well. 


e) Phrases: Cooper et al [23] propose identification of 
phrases using a phrase-detection system and computing 
a document-vector that includes phrases as terms. They 
have tested their ideas on a very small collection (tens 
of thousands). The idea of using phrases also appears 
in the work of Hammouda and Kamel [32] who build 
sophisticated indexing techniques for web-clustering. 


We chose to work with the document vector model; simhash 
converts document vectors into fingerprints. Augmenting 
the document vector by other signals (anchor text and con- 
nectivity information, for example) might improve the qual- 
ity of our system. We leave these possibilities as future work. 
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5.4 Signature schemes 


a) Mod-p shingles: A simple compression scheme for shingle- 


based fingerprints is to retain only those fingerprints 
whose remainder modulus p is 0, for a sufficiently large 


value of p. The number of fingerprints retained is variable- 


sized. Moreover, it is important to ignore commonly- 


occurring fingerprints since they contribute to false-matches. 


A drawback of this scheme is that the distance between 
successive shingles that are retained, is unbounded. This 
problem has been ameliorated by the “winnowing” tech- 
nique by Schliemer et al [50]. Hoad and Zobel [36] 
compare a variety of other ideas for pruning the set of 
shingle-based fingerprints. 


b) Min-hash for Jaccard similarity of sets: For two 
sets A and B, let the measure of similarity be KH, 


also known as the Jaccard measure. Interestingly, it is 
possible to devise a simple signature scheme such that 
the probability that the signatures of A and B match is 
exactly the Jaccard measure [13, 14]. 


Several experimental studies have tested the efficacy of 


min-hash in various settings (Cohen et al [21] for association- 


rule mining, Chen et al [18] for selectivity estimation of 
boolean queries, Gionis et al [30] for indexing set-value 
predicates and Haveliwala [33] for web-clustering). 


Signatures/fingerprints over IR-based document 
vectors: Charikar’s simhash [17] is a fingerprinting tech- 
nique for compressing document vectors such that two 
fingerprints are similar iff the document vectors are sim- 
ilar. Another technique for computing signatures over 
document-vectors is the I-Match algorithm by Chowd- 
hury et al [20] that we described earlier. An improved 
I-Match algorithm appears in [40]. These algorithms 
have been tested on small document-collections (of the 
order of tens of thousands) and appear fairly brittle. 


Checksums: Pugh and Henzinger’s patent [47] contains 
the following idea: we divide words in a document into k 
buckets (by hashing the words, for example), and com- 
pute a checksum of each bucket. The set of checksums 
of two similar documents should agree for most of the 
buckets. 


We chose to work with simhash primarily because it allows 
us to work with small-sized fingerprints. 


Summary 


Most algorithms for near-duplicate detection run in batch- 
mode over the entire collection of documents. For web crawl- 
ing, an online algorithm is necessary because the decision 
to ignore the hyper-links in a recently-crawled page has 
to be made quickly. The scale of the problem (billions of 
documents) limits us to small-sized fingerprints. Luckily, 
Charikar’s simhash technique with 64-bit fingerprints seems 
to work well in practice for a repository of 8B web pages. 


6. FUTURE EXPLORATIONS 


Using simhash is a good first step for solving the near- 
duplicate detection problem. Many other ideas hold promise 
of improving the quality of near-duplicate detection, and/or 
making the system more efficient. We list a few: 


A. Document size has been shown to play an important role 
in near-duplicate detection in certain contexts. For ex- 
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ample, in Conrad and Schriber [22], two legal documents 
are deemed to be duplicates iff they have 80% overlap in 
terminology and +20% variation in length (these were 
arrived at by consulting the Library Advisory Board 
who are trained in the field of Library Science). Per- 
haps we should devise different techniques for small and 
large documents. Or perhaps, we should reserve a few 
bits of the 64-bit fingerprint to hold document length. 


Is it possible to prune the space of existing fingerprints by 
asserting that certain documents never have duplicates? 


Could we categorize web-pages into different categories 
(for example, by language type), and search for near du- 
plicates only within the relevant categories. 


. Is it feasible to devise algorithms for detecting portions 
of web-pages that contains ads or timestamps? Perhaps 
such portions can be automatically removed so that ex- 
act checksums over the remaining page suffice for dupli- 
cate detection. 


How sensitive is simhash-based near-duplicate detection 
to changes in the algorithm for feature-selection and as- 
signment of weights to features? 


How relevant are simhash-based techniques for focused 
crawlers [27, 43, 46] which are quite likely to crawl web 
pages that are similar to each other. 


. Can near-duplicate detection algorithms be developed 
further to facilitate clustering of documents? 
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